By Yukun & Mike
Part I -
05/08/2019

The notebook server, not the kernel, is responsible for saving and loading notebooks, so you can edit notebooks even if you don’t have the kernel for that language—you just won’t be able to run code. The kernel doesn’t know anything about the notebook document: it just gets sent cells of code to execute when the user runs them.
TL;DR
We interact with Browsers to communicate with Server which is built on a Kernel.
A Kernel decides the type of the program you are running, e.g. Python, R, and Scala.

a="DAB"
print("I love DIL")
print(a)
| Appearence | State |
|---|---|
| [] | Cells to be run |
| [number] | Cells that have been executed |
| [*] | Cells that is running |
Edit Mode: Run codes using the Kernel. When you see the pencil symbol on the upper right section of the browser and the color of cell is green, you should know you are in the Edit Mode.Command Mode: Do things with the cells not the content of the cells in the Command Mode. The pencil is gone and the color of the block is blue in this mode.How to change the mode efficiently?
Edit ModeEdit ModeCommand ModeEdit Mode to Command Mode | Shortcut | Function |
|---|---|
| ESC | Go to the Command Mode |
| Ctrl+Enter | Run the Cell |
| Ctrl+Shift+Enter | Run the Cell and Select the Cell Below |
| CM+A | Insert A Cell Above |
| CM+B | Insert A Cell Below |
| CM+M | Change the Current Cell to Markdown Cell |
| CM+Y | Change the Current Cell to Code Cell |
| CM+F | Search and Replace |
| CM+M | Merge the Cells |
| Shift+Up/Down | Selcet Multiple Cells |
| Tab | Code Auto-fill |
There are many formats for you to download.
File - Download As - ...
My personal Fav. Take 2:
pip install jupyter_contrib_nbextensions
or
conda install -c conda-forge jupyter_contrib_nbextensions jupyter contrib nbextension install --userMy Personal Favorites Take 3:
For example, let's try the Snippets
from __future__ import print_function, division
import numpy as np
You can run the shell commands in Unix-like system, such as Linux or Mac OSX in Jupyter Notebook too!
%ls
%pwd
Some of the enhancements that IPython adds on top of the normal Python syntax. All of them begin with %
There are two kinds of magic functions:
List all the magic Functions
%lsmagic
Running External Code
%run p.py
One of the most commonly used one: %matplotlib inline
%matplotlib inline
If you want to have vector image(the one won't be blurred when zooming), try this:
set %config InlineBackend.figure_format = 'svg'
%config InlineBackend.figure_format = 'svg'
Calculate the Time for Running Current Cell
%%time
a=0
for i in range(220000):
a+=1
There are several ways you can get references inside the Jupyter Notebook
?print
Docstring: print(value, ..., sep=' ', end='\n', file=sys.stdout, flush=False)
Prints the values to a stream, or to sys.stdout by default. Optional keyword arguments: file: a file-like object (stream); defaults to the current sys.stdout. sep: string inserted between values, default a space. end: string appended after the last value, default a newline. flush: whether to forcibly flush the stream. Type: builtin_function_or_method
So, the result would be the docstring of the function.
They have references for:
Usually, you can print the value of a statement without print in Jupyter. (Of course, that's why it's useful for interactive programming)
a=6
a
By adding comma to the end of the line, it would not show the value of the variable or the statement
a+6;
Markdown is a lightweight markup language with plain text formatting syntax. Its design allows it to be converted to many output formats, but the original tool by the same name only supports HTML.
TL;DR
Markdown is for text editting and supports everything that supports HTML
ALL THE WORDS IN THIS WORKSHOP DOCUMENT ARE WROTE IN MARKDOWN SYNTAX
There are many markdown rules, and I'm also going to pick out my personal favorites. :)
# H1
## H2
### H3
#### H4
##### H5
###### H6
Emphasis, aka italics, with asterisks or underscores.
Strong emphasis, aka bold, with asterisks or underscores.
Combined emphasis with asterisks and underscores.
Strikethrough uses two tildes. Scratch this.


The things in the [] is the name of the fig, while the URL shoudl be put into the (). The hovering name of the fig should be put after the URL with a whitespace in between.
But be careful when you insert images using Markdown, there are pitfalls...
| D | I | L |
|---|---|---|
| DigitalDigital | InnovationInnovation | LabLab |
| D | I | L |
|---|---|---|
| Digital Digital | Innovation Innovation | Lab Lab |
The first line is the header.
The second line indiciates the alignment.
The value of the table begins at 3rd line
There must be pipes to segement each cell of the table.
This section is basically adapted from this site
This section is basically adapted from this site
There are some other syntax like making a citation and references...But I personally think that is cumbersome and not very useful for us. If you want to know more about Markdown, please Google some relevant material online.
P.S. These comprehension statements could also be applied to Dictionaries and Tuples
Assuming everyone has a basic grasp of List, Dictionaries, Tuples, Functions, Loops, Conditional Statements, etc.
If you are not sure about these things, please feel free to ask 😊
Create a list from a sequence based on a condition
Syntax: [<expr> for <item> in <seq> if <cond>]
%%time
t =[]
for i in range(100000):
if i%2 == 0:
t.append(i)
%%time
t=[i for i in range(100000) if i%2==0]
Imagine I have a list a [1,2,3] and a list b ["A","B","C"]. I want a new list to have each of them combine as a new item. What should I do?
a=[1,2,3]
b=["A","B","C"]
c=[]
for itemfroma, itemfromb in zip(a,b):
print(itemfroma,itemfromb)
c.append((itemfroma,itemfromb))
print(c)
a={"d":1, "i":2, "l":3}
{value:key for key,value in a.items()}

import re
a = "199987659955"
match = re.search(r'9+',a)
print (match.group())
match = re.search(r'19*',a)
print (match.group())
match = re.search(r'9*',a)
print (match.group())
match = re.search(r'9+.*9+',a)
print (match.group())
import re
match = re.search(r'pi+', 'piiig')
print (match.group())
match = re.search(r'i+', 'piigiiii')
print (match.group())
match = re.search(r'\d\s*\d\s*\d', 'xx1 2 3xx')
print (match.group())
match = re.search(r'\d\s*\d\s*\d', 'xx12 3xx')
print (match.group())
match = re.search(r'\d\s*\d\s*\d', 'xx123xx')
print (match.group())
import re
a = "A765-2781-ZFQ"
match = re.search(r'([AB])([0-9]+)-([0-9]+)-([A-Z0-9]+)',a)
print (match.group())
print (match.group(1))
print (match.group(2))
print (match.group(3))
print (match.group(4))
But we will some some of them in the following content 😉
It means all we gonna learn is for actual use not thorough understanding of the underlying mechanism, but first of all:
What are Pandas and Numpy
NumPy is a powerful python library that expands Python’s functionality by allowing users to create multi-dimenional array objects (ndarray). In addition to the creation of ndarray objects, NumPy provides a large set of mathematical functions that can operate quickly on the entries of the ndarray without the need of for loops.
The pandas (PANel + DAta) Python library allows for easy and fast data analysis and manipulation tools by providing numerical tables and time series data structures called DataFrame and Series, respectively. Pandas was created to do the following: provide data structures that can handle both time and non-time series data; allow mathematical operations on the data structures, ignoring the metadata of the data structures; use relational operations like those found in programming languages like SQL (join, group by, etc.); handle missing data
TL;DR
# Those abbrs. are traditions
import numpy as np
import pandas as pd
pandas basically supports all kinds of files, as long as they are format in a tabluar way
Let's have a look at our dataset. Remeber the SHELL commands we just mentioned before?
pd.read_csv("Static Tweets with Norm Loc.csv", index_col="Unnamed: 0").head(3)
db=pd.read_csv("Static Tweets with Norm Loc.csv", index_col="Unnamed: 0")
user=pd.read_csv("Users with Nor Loc.csv", index_col="Unnamed: 0")
.info() could show you the structure fo the dataframe, including the indexes, the columns, and their datatype. It is very common to use this function everytime you begin processing a dataset.
db.info()
.describe() gives you the distributions of some numeric variables. It would show you the count, the mean, the standard deviation, and some other metrics.
user.describe()
Rows are accessed by Index
Index are unqiue identifiers of the rows
Usually, they are just numbers
List all the indexes in the dataframe
db.index
If you want to select a row, using .loc function.
db.loc["1"]
The loc attribute allows indexing and slicing that always references the explicit index. So, you have to make sure the index you want to access is actually in the dataset. For example, in our dataset, the index is string, not integer, so the following command would return an error.
db.loc[1]
The select of rows is identical to operate an numpy array. That is to say, you could slice the index in multifurious ways.
#select row 2 to row 4
db.loc["2":"4"]
#select multiple rows at a time
db.loc[["6","2","8"]]
In summary, the things inside .loc[] could only be one of these three things:
Besides of using explict labels, we could also using the positional index to access a row
In this scenario, we use the .iloc function
db.iloc[2]
In this situation, the things inside the iloc function could only be one of the followings:
We will talk about the last two later
db.iloc[2:4]
db.iloc[[2,8,6]]
Each column in pandas dataframe is called Series. These are some common ways to access Series
db.from_user
db['from_user']
You can select rows and columns at the same time. One way is to use loc and iloc.
db.loc['2':'6',['from_user','time']]
db.iloc[[6,9,10],2:5]
You could also do it step by step. Select a column, then pinpoint the index, vice versa.
db['text']['2':'6']
db.loc['2':'6']['text']
Now, let's go for contiditonal selecting
For example, we want the records with more than 10 retweets
db.retweet_count>1000
Now we got a Series of Boolean values. Next, we just need to use this as an index for selection.
db.loc[db.retweet_count>1000,:]
db.loc[lambda x: x.retweet_count>1000]
db.loc[(db.retweet_count>1000) & (db.favorite_count>10),['from_user',"retweet_count","favorite_count"]]
db.head(3)
db.tail(3)
db.iloc[:5].retweet_count
db.iloc[:5].retweet_count + 5
db.iloc[:5].retweet_count.count()
db.iloc[:5].retweet_count.min()
db.iloc[:5].retweet_count.max()
db.iloc[:5].retweet_count.sum()
db.iloc[:5].retweet_count.idxmax()
user.iloc[:5].tweets_num / user.iloc[:5].created_days
apply and map functions are also very useful when you want to do the same thing to a series or a row
def messify(x):
return (x**2)+x-5
user.iloc[:5].tweets_num.apply(messify)
user.iloc[:5].tweets_num.map(messify)
The most knotty thing is the axis and level
d=pd.DataFrame([['inls101','f12',12,3,2,2],['inls101','f12',12,3],['inls103','f13',12,3,3,6]])
d.columns=['Course','Sem','x','y','z','d']
d=d.set_index(['Course','Sem'])
d
#default axis = 0
d.sum(axis=0)
d.sum(axis=0, level=0)
d.sum(axis=0, level=1)
d.sum(axis=1)
None is a Python singleton object that is often used for missing data in Python code. Because it is a Python object, None cannot be used in any arbitrary NumPy/Pandas array, but only in arrays with data type 'object' (i.e., arrays of Python objects):
Pandas treats None and NaN as essentially interchangeable for indicating missing or null values. To facilitate this convention, there are several useful methods for detecting, removing, and replacing null values in Pandas data structures. They are:
isnull(): Generate a boolean mask indicating missing values
notnull(): Opposite of isnull()
dropna(): Return a filtered version of the data
fillna(): Return a copy of the data with missing values filled or imputed
isna() is used to check if there is missing values
user.tweets_num.isna().head(5)
The opposite is notna()
user.loc[user.geo_coordinates.notna()]
dropna() is used to delete missing values
fillna() is used to fill the missing values
db.iloc[:5,15:17]
db.iloc[:5,15:17].fillna(0)
db.iloc[:5,15:17].fillna(method="ffill", axis=1)
Now let me introduce a tool for missing values
import missingno
import seaborn as sns
sns.set()
missingno.matrix(user, labels=True)
missingno.bar(user)
This is another powerful tool to produce report on your dataset. It's interactive and based on JavaScript
But for some reason, it could not run on my machine...
import pandas_profiling
pandas_profiling.ProfileReport(db.iloc[:100,:10])
Groupby
Groupby is one of the most widely used data aggregation tool in pandas. So, basically, it split the data into different chunks, and then, apply a funtion to each of them, and return the last value at last.

db.groupby(["lang_trans"]).id_str.count()
db.groupby(["lang_trans"]).agg({"id_str":"count"})
There are two kind of data format.

db.groupby(["lang_trans",'trans_sour']).count().iloc[:,:2]
db.groupby(["lang_trans",'trans_sour']).count().iloc[:,1].unstack()
Pivot Table is a shortcut for groupby. If you use PivotTables in Excel, you will be familiar with the usage of this in pandas too.
pd.pivot_table(index="lang_trans", columns="trans_sour", data=db,values="id_str", aggfunc="count",margins=True )
Wait, there is more simple ways if you only want to count the numbers.
Let's introduce crosstab
pd.crosstab(db["lang_trans"],db["trans_sour"])
This part is also practical oriented, so I don't want to delve into the gritty-nitty of the data viz. I only focus on how to make a visualization fast and simple.
There are many ways to make a plot in Python.
from __future__ import print_function, division
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline
plt.scatter(x="tweets_num",y="created_days",data=user.fillna(0))
But because it provides many customerized functions, so the learning curve is steep and you have to add the elements one by one.
And, the dafault style is so UGLY
That's why I recommend to use Seaborn
Seaborn is a highly-encapsulated pacakage built on Matplotlib. But it is waaaaaaaay much user-friendly and BEAUTIFUL
It's like Matplotlib with makeups
user.plot(kind="scatter",x="tweets_num",y="created_days")
import seaborn as sns
sns.set()
fix, ax=plt.subplots(figsize=(15, 10))
sns.scatterplot(data=user,x="tweets_num",y="created_days",hue="lang_trans", palette="Set2", size="tweets_num",
alpha=0.3, x_jitter=True, ax=ax).set_title("Test");
There are some interactive plot packages, the most popular ones are bokeh and plotly
from bokeh.plotting import figure, output_notebook, show
from bokeh.models import ColumnDataSource
from bokeh.models.tools import HoverTool
output_notebook()
p = figure()
p.circle(x="tweets_num",y="created_days",
source=user,
size=10, color='green')
p.title.text = 'test'
p.xaxis.axis_label = 'tweet_num'
p.yaxis.axis_label = 'creatd days'
hover = HoverTool()
hover.tooltips=[
('tweet_num', '@$tweet_num'),
('creatd days', '@$creatd days')
]
p.add_tools(hover)
show(p)